Robust Tagging System for Lexicon Creation

نویسنده

  • Anna Pappa
چکیده

This paper presents a robust rule-based system of shallow parsing for part-of-speech (PoS) recognition and tagging. Unlike previous work the system uses parsing to tagging based on unsupervised learning methods with no prior knowledge, nor training or pre-tagged corpora. START (System of Textual Analysis Recognition and Tagging) has been evaluated on both French and Greek non-annotated corpora, proving portability and adaptability to languages with similar syntactic features. Its accuracy rate exceeds 92% for recognition of noun and verb phrases and 99% for disambiguation of ambiguous cases –such as the definite article and the personal pronoun. After an auto-evaluation, the low error rate-less than 1%-makes trustworthy the knowledge acquisition and the parser proceeds to a lexicon creation, with entries having morpho-syntactic features for further automatic annotation on plain text corpora.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Grammar-based tools for the creation of tagging resources for an unresourced language: the case of Northern Sotho

We describe an architecture for the parallel construction of a tagger lexicon and an annotated reference corpus for the part-of-speech tagging of Nothern Sotho, a Bantu language of South Africa, for which no tagged resources have been available so far. Our tools make use of grammatical properties (morphological and syntactic) of the language. We use symbolic pretagging, followed by stochastic t...

متن کامل

Feature extraction in opinion mining through Persian reviews

Opinion mining deals with an analysis of user reviews for extracting their opinions, sentiments and demands in a specific area, which can play an important role in making major decisions in such area. In general, opinion mining extracts user reviews at three levels of document, sentence and feature. Opinion mining at the feature level is taken into consideration more than the other two levels d...

متن کامل

Tagging Icelandic text: A linguistic rule-based approach

The Icelandic language is a morphologically complex language, for which a large tagset has been created. This paper describes the design of a linguistic rulebased system for part-of-speech tagging Icelandic text. The system contains two main components: a disambiguator, IceTagger, and an unknown word guesser, IceMorphy. IceTagger uses a small number of local elimination rules along with a globa...

متن کامل

A Practical Part-of-Speech Tagger

We present an implementation of a part-of-speech tagger based on a hidden Markov model. The methodology enables robust and accurate tagging with few resource requirements. Only a lexicon and some unlabeled training text are required. Accuracy exceeds 96%. We describe implementation strategies and optimizations which result in high-speed operation. Three applications for tagging are described: p...

متن کامل

Collecting and POS-tagging a lexical resource of Japanese biomedical terms from a corpus

The following paper explains the methodology followed for the creation of a morphologically tagged medical lexicon in Japanese. In order to build this medical resource we have taken into account the morphosyntactic characteristics of the language as well as the origins and formation of the medical terms. Following this, we have compiled a list using the Japanese MutiMedica corpus, special tags ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006